The International Monetary Fund (IMF or ‘the Fund’) is a global organisation of 190 countries that was set up to promote monetary cooperation, financial stability, international trade, economic reform and poverty reduction. It is funded by member countries, who can draw on the Fund’s resources if they encounter financial difficulties (IMF 2022). The IMF also keeps a permanent and visiting staff focused on economic research. One of the IMF’s flagship research publications is the Working Paper series, for which online records are available back to the early 1990s.
We analyse the Fund’s research with respect to the Working Paper series. We use a topic modelling approach built on the embeddings from the popular BERT transformer architecture. Overall, the topic model produces a set of detailed coherent topics that summarise IMF research over the past 30 years. These topics are robust to several variations of the Working Paper corpus (using paper summaries, titles and subject tags). Many topics relate to issues firmly within the IMF’s remit, such as exchange rates, monetary policy, economic growth and development and reform (e.g. related to labour markets, demography and pensions). However, over the past 10 years contemporary social issues have also gained prominence in IMF research. This is reflected in a surge in research about topics like inequality, climate change and more recently COVID-19, to the extent that these are some of the most popular topics of IMF research over the past 30 years.
Of course our analysis only scratches the surface of the information the topic model has summarised from the Working Papers. To encourage the reader to explore the outputs and reach their own conclusions, we have included interactive charts in this report where possible.
There is a growing literature that uses text mining techniques to analyse the content of economic and news publications. A detailed review of the literature can be found in Avetisyan (2021).
One strand of literature uses the frequency of certain types of words to construct economic indices. A prominent example is the policy uncertainty index, which matches the occurrence of a pre-defined set of words in news articles (Baker, Bloom, and Davis 2016). Other examples are related to constructing economic sentiment indices through counting the frequency of words appearing in a large dictionary of positive and negative terms (Nguyen, La Cava, and others 2020). The benefit of these approaches is that they are transparent and simple to understand. However, they are also inflexible in the sense that the ‘algorithm’ underlying the analysis is pre-defined and doesn’t actually learn anything from the data.
Another strand of literature uses unsupervised methods, such as latent semantic analysis (LSA) or latent Dirichlet allocation (LDA), to assess the content of economic statements and relate this to macroeconomic data. Much of this analysis has focused on central bank communications, especially for the US Federal Reserve. For example, Hansen, McMahon, and Prat (2018) use LDA topic modeling to assess how transparency reforms have affected the content of FOMC deliberations. Battaglia and Salunina (2020) use a dynamic LDA topic model to construct topic representations that can be used as an input to forecast macroeconomic conditions. Unsupervised methods such as LDA are useful because they ‘learn’ their parameters from otherwise unstructured text data. These methods have two main downsides though. One is that they use a bag of words representation of documents, which ignores the (obviously relevant) ordering of words. The other is that their outputs are highly sensitive to the tuning of hyperparameters (including the number of topics) and also how the text is pre-processed.
Related to the IMF, there is a small literature that employs text mining techniques. For instance, Anderson et al. (2021) use clustering and dimensionality reduction (PCA) on IMF communiques and constituency statements to analyse how the priorities of IMF members have changed over time. In addition, Mihalyi and Mate (2019) analyse IMF Article IV reports (country level reports and recommendations) to assess the nature of IMF country surveillance, while IMF (2019) analyse the IMF’s social spending surveillance. However, to our knowledge, there is no research to date that looks at the content of IMF publications from a topic modelling perspective.
We build our corpus by web-scraping the official IMF website. The scraper uses the Selenium package in python. Scraping occurs over two steps. First, cycle through each listing page of working papers and collect the URL for each working paper. Then, open the URL for each working paper and append all of the metadata on that page to a dataframe. The result is a data frame containing all the metadata for each of the working papers. The code underlying the scraper is attached and could easily be extended to download the PDFs for each working paper, though we did not have a use for this.
Using the scraper, we collected metadata for 7,246 working papers published between 1990 and 2022. We chose to collect data for the working paper series (as opposed to other publications such as Staff Discussion Notes or Article IV country reports) because the series is available for the longest continuous period of time.
Figure 1: Wordcloud of IMF Working Papers
Chart 1 shows the number of papers collected for each year of our sample. Earlier in the sample period, there were few papers to scrape but this sharply increased from the mid-1990s and stabilized at around 200-300 papers per annum from the mid 2000s.
Subject to availability, the scraper collected the following data for each paper:
The collection of paper summaries form the main corpus for our paper. We use the summaries, rather than the actual text of the paper, because it is computationally efficient to process and they should capture the main content and findings of each article. Chart 2 shows that the majority of documents in our corpus have 100-200 words in the summary, which is manageable. We supplement this main corpus with corpi formed from the titles and (manually assigned) subject tags of the papers (see below), which is used to check the robustness of our results.
The subject tags assigned to papers provides another avenue for robustness checking. Chart 3 shows the distribution of subject tags across papers with the majority of papers having less than 10 subject tags. However the assignment of subject tags appears to occur on an adhoc basis, so the tags required a significant amount of cleaning to distill them to a set of usable subjects. Chart 4 shows the most popular subject tags (after cleaning), by their frequency of appearance. The most popular subject tags correspond to topics firmly in the IMF’s remit - central banks, fiscal policy, exchange rates, banks and the financial sector, inflation, the labour market and the financial crisis.
Text pre-processing is not strictly required for our modeling approach and in some more complex natural language applications (such as machine translation or word prediction) may even be undesirable (Bricken 2021). However, topic modeling is a relatively simple NLP task and pre-processing can help improve the interpretability of our results. Intuitively, topic modeling simply groups documents based on words (or combinations of words) that are common across documents (and unique to that group). Words that are slight variations of each other should play the same role in describing topics and be represented by the same token. Furthermore, punctuation, symbols and stop words are unlikely to contribute to identifying topics in our corpus and simply increase the dimensionality of our data. With this in mind, we apply the following pre-processing steps to the summary and title fields:
For the subject tags, we use the same lemmatiser and ngrams to extract popular multi-word subject tags. To help with interpretability, we also manually consolidated some of the similar subject tags into groups (e.g. exchange rates and foreign exchange both map to exchange rates in the subject tags).
Transformer models are a neural network architecture that have excelled in a range of natural language processing (NLP) tasks. Transformers use the concept of attention to learn short- and long-term dependencies between words in a sequence (Vaswani et al. 2017). The transformer model outputs a vector of embeddings (in euclidean space) for each vocabulary word it is trained on, which is suitable for deployment in many downstream machine learning tasks. The embeddings contain a representation of a word’s frequency and, via the attention mechanism, its positional context.
The power of the transformer architecture is that it can be pre-trained on a large corpus of text, say all of Wikipedia, to produce word embeddings. The user can then ‘fine-tune’ the embeddings using their (comparatively tiny) corpus. This allows them to take advantage of embeddings trained on a corpus much larger than most would have access to for their NLP task.
The transformer we have chosen is the popular pre-trained BERT (Bidirectional Transformer for Language Understanding) architecture (Devlin et al. 2018). BERT augments the transformer architecture from Vaswani et al. (2017) by making the context around a word of interest bi-directional (i.e. a word appearing to the left of the word of interest is treated differently to the same word appearing to the right). This allows the model to predict words that appear on both sides of a word of interest (in the original transformer architecture, the model could only predict words that occur after a word of interest).
In BERT, each word in the vocabulary actually contains multiple embeddings, which capture the different contexts in which a word might appear. This is attractive because it moves away from the ‘bag of words’ approach to text mining (which ignores the context around a word). The downside is that the embeddings from BERT are very high dimensional, which can be a problem for some downstream tasks.
There are many versions of BERT trained on different corpi in many languages. We use the ‘all-mpnet-base-v2’ model (HuggingFace 2022), which is a general use, English language version of BERT suitable for encoding sentences and small paragraphs. This model is trained on data from various sources, including Reddit, Wiki-answers and Stack Exchange. It outputs a set of 768-dimensional embeddings. According to the documentation, the embeddings are best used downstream in tasks such as sentence similarity, information retrieval or clustering. Since we will employ clustering to generate our topics, this is a suitable model for us.
We deploy the fine-tuned embeddings from BERT on our corpus to generate a topic model, following the procedure outlined in Grootendorst (2022). This technique is very new and is inspired by the Top2Vec model (Angelov 2020).
Grootendorst (2022) outlines the steps as (summarised in Figure 3):
Dimension reduction: Our BERT word embeddings have 768 dimensions. Many clustering algorithms handle high dimensional data poorly, since in high dimensions the ‘distance’ between all points tends to converge to the same value (Aggarwal, Hinneburg, and Keim 2001). To address this, Grootendorst uses the UMAP (Uniform Manifold Approximation Projection) algorithm to project the embeddings onto a lower dimensional manifold (McInnes, Healy, and Melville 2018). This algorithm is a good choice because it keeps a lot of the high-dimensional local structure of our data in the lower dimensionality.
Clustering: pass the reduced embeddings to the hierarchical (H)DB-SCAN clustering algorithm. According to Grootendorst, HDB-SCAN works well with the output from UMAP. The algorithm learns the optimal number of clusters and does not force assignment for every observation (observations can be labeled as ‘outliers’). These are downsides of other clustering algorithms that can make their results challenging to interpret consistently. The clusters chosen by HDB-SCAN will form the ‘topics.’
Topic creation: to interpret topics, we need to distinguish between them in a meaningful way. Grootendorst constructs a class-based TF-IDF measure to handle this. This method combines all documents within each cluster into their own ‘class’ (so each cluster is represented by one concatenated ‘class’ document). Then, a TF-IDF representation is applied to the class-level documents. The words with the top TF-IDF score from each class summarise each topic. Using the top scoring TF-IDF words is a clever way to systematically identify topics, because these are the words that are most ‘unique’ in each cluster. Figure 2 summarises the class TF-IDF calculation.
Figure 2: Class-based TF-IDF
Topic reduction: Although HDB-SCAN chooses the number of topics automatically, topics may overlap significantly or there may be too many topics for the user to interpret meaningfully. We can identify overlapping topics using an ‘intertopic distance map,’ which projects the embeddings into a 2 dimensional space (using UMAP) and then plots topics according to their cluster centroids (the size of the topics in the map refers to the number of documents assigned to each topic). Based on this, one can decide on whether to reduce topics. One way to reduce topics is to force HDB-SCAN to use a certain amount of topics. Another is to combine topics where the average document in each cluster has a cosine similarity above some threshold (default is 0.9) (Angelov 2020). We reduce the number of topics via the cosine similarity method.
Maximal marginal relevance (MMR): once the topics and their word representations are complete, an optional step is to assess the coherence of the topic descriptions. A common issue is that the top words can be very similar (even after we have lemmatised). This is especially so when using bigrams. For example, in a topic about exchange rates three of the top terms might be exchange rate, exchange and rate! Clearly, this makes some of the top terms redundant for interpretation. One way to correct this problem is to apply maximal marginal relevance. This is a ranking technique which, given a list of terms, attempts to balance between their similarity and diversity (Carbonell and Goldstein 1998).
Dynamic topics: finally, we add a dynamic element to the topic model by specifying a fixed number of periods (time stamps) to split the data over. To calculate the time-specific representations of a topic, term frequencies are calculated for the documents in each topic and time period t. These are then averaged with the global (all time) and period t-1 class TF-IDF representations to help with stability and persistence in the dynamic process. The benefit of this approach is that we get a distinct set of topic words for each topic in each time slice.
Figure 3: BerTopic algorithm
We pass our pre-processed IMF summaries through the BerTopic algorithm outlined above and set parameters as follows:
For each document, the algorithm outputs a vector of probabilities corresponding to each topic. Topics are assigned to documents according to the highest probability. Table 1 shows the number of documents assigned to the top 10 topics. The highest scoring class TF-IDF words describe each topic. Topic -1 contains all the documents not assigned to a topic. The initial run of the algorithm outputs 65 topics overall, with around one third of the documents assigned the outlier topic.
Each topic in Table 1 is very coherent and relates mostly to international macroeconomics, which comes as no surprise. IMF Working Papers most commonly discuss exchange rates, monetary policy, and fiscal policy and sovereign debt. It is also interesting to see topics described by income inequality and climate change feature in the top 10, despite these being more contemporary issues.
| Topic | Count | Name |
|---|---|---|
| -1 | 2670 | -1_financial_policy_paper_bank |
| 0 | 328 | 0_exchange_exchange rate_rate_real exchange |
| 1 | 296 | 1_inflation_monetary_monetary policy_policy |
| 2 | 246 | 2_tax_revenue_vat_income |
| 3 | 237 | 3_debt_sovereign_bond_spread |
| 4 | 235 | 4_fiscal_fiscal policy_consolidation_rule |
| 5 | 215 | 5_wage_labor_unemployment_labor market |
| 6 | 153 | 6_inequality_income_poverty_education |
| 7 | 123 | 7_shock_cycle_business cycle_business |
| 8 | 110 | 8_climate_subsidy_emission_carbon |
Table 1: Topic assignments
Figure 4 visualises the topics using an intertopic distance map. The intertopic distance map projects the average embedding for each topic into a two-dimensional space (using UMAP). The size of each topic ‘bubble’ represents the number of documents assigned to the topic. Topics that appear closer together on the map are ‘more similar’ (in the sense that their embeddings have a higher cosine similarity). Since many of the topics overlap, you can click and drag on a cluster of topics to explore a particular area of the map in more detail.
Despite a degree of overlap between the topics, the clusters reveal that most of the 65 topics summarise a particular economic issue coherently. Additionally, the overlapping topics represent related (but also distinct) economic issues.
For example, the cluster of topics in the bottom left of the map are about fiscal policy, sovereign debt, taxes and but then also commodity price shocks, oil and climate change. Each of these topics is distinct, but they are also related and therefore are likely to use similar language. One could summarise that working papers in this cluster of topics commonly relate to governments that rely on resource extraction for export and tax revenues. A common issue these governments (often in emerging markets) face is how to deal with the sustainability of fiscal policy and debt servicing when they are exposed to commodity price shocks. A more contemporary question has been how these countries will adapt to the transition away from fossil fuels that will result from climate change.
Similar narratives appear for the other clusters of related topics. Some examples are aid, remittances, natural disasters and IMF support programs (bottom middle), demography, pensions and savings (bottom middle left) or trade, financial integration, dollarisation, foreign currency and capital flows (also bottom middle left).
Figure 4: Intertopic distances
Even though many of the overlapping topics have distinct interpretations, some are still very similar. Figure 5 shows the distribution of pairwise cosine similarities across different topics (calculated based on their embeddings). We can threshold the cosine similarity to reduce the number of topics. This recursively combines all topics with a pairwise score above the threshold, which we set as 0.9, and occurs iteratively until there are no similarities above the threshold. The recursive procedure explains why, despite observing few similarity scores above 0.9 in Figure 5, applying this technique reduces the number of topics from 65 to 32.
Figure 5: Distribution of cosine similarities
The next step is to choose a number of words to describe each topic. To do this, we can apply the ‘elbow’ method to the class TF-IDF scores (term scores). Figure 7 plots the term scores for the top words in each topic. Here the idea is to choose the number of top words based on the inflection point when the marginal contribution of an additional word becomes constant (the elbow). From Figure 7 it appears that this happens for most topics between 4-6 words, so we decide to describe each topic by its top 5 words.
Figure 7: Term scores
Figure 9 shows the top words for the same set of topics after applying MMR (with a relatively high value for the diversity parameter to ensure a diverse representation of words). MMR has consolidated the top words for all of the topics with repeated top words to allow for a richer interpretation of the topics using the same number of key words. For instance, in our topic on exchange rates we can now see the research relates to exchange rate regimes and (central bank) intervention into the foreign currency market. Or for the topic on labour markets, we can now posit that IMF research often related to reform (which is more specific to the IMF’s remit).
Figure 9: Top words per topic (after MMR)
Finally, we segment each topic into time slices. The dynamic topic model requires us to specify a number of time slices to apply. We choose 10 time slices, corresponding to segments of 3 years. All else equal, longer segments helps to reduce within topic volatility, especially for smaller topics. The trade-off is having fewer documents to represent a topic within a particular time slice. This forms the final output from the model, which we will analyse in the next section. The dynamic topic model over all topics is Figure A2 in the appendix.
In this section we use the dynamic topic representations to discuss two instances of how IMF Working Papers have attended to various economic issues over time. Given the 65 topics, one could construct many case studies involving groups of topics, we have chosen to discuss only two as an illustration. We observe that IMF Working Papers appear largely reactive to economic events, which makes sense as they are longer-term pieces of research.
Figure 10 shows selected topics related to exchange rates, fiscal and monetary policy and banking. There is a clear spike in research regarding exchange rates, foreign exchange intervention and sovereign debt from the lates 1990s to early 2000s, around the time of the Asian Financial Crisis. The Asian Financial Crisis started in 1997 when Thailand unpegged its currency from the US Dollar, leading to large capital outflows from emerging Asian economies and significant depreciation in their currencies. In response to the financial crisis, the IMF provided support packages to the Indonesian, Korean and Thai economies.
Separately, we can also see a uptick in research about stress testing and banking crises in the years around the Global Financial Crisis. These coincide with a greater amount of regulatory scrutiny on the banking sector following the GFC. For the topic on monetary policy, the increase in papers around this period could reflect central banks deploying unconventional monetary policies (e.g. quantitative easing) in response to the crisis, many for the first time in the inflation-targeting era. For sovereign debt we see that the peak in research persists for around a decade, which makes sense given the European sovereign debt crisis followed the GFC in 2013.